Today we will build upon the graphing approaches in the with all the Data Carpentry ggplot tutorial
The Cookbook for R by Winston Chang is also great for tidying up our graphs.
Here are a couple of cheat sheets that can be useful
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color=Species, shape=Species)) +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width", color = "Plant Species", shape = "Plant Species") https://r-charts.com/ggplot2/themes/
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color=Species, shape=Species)) +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width", color = "Plant Species", shape = "Plant Species") +
theme_classic() ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(color = "red", aes(shape = Species))+
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species, shape = Species)) +
scale_color_manual(values=c("blue", "purple", "red")) +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species, shape = Species)) +
scale_color_brewer(palette="Dark2") +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") library(viridisLite)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(fill = Species), color = "black", pch=21) +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") https://sjmgarnier.github.io/viridisLite/
library(viridisLite)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species, shape = Species)) +
scale_colour_viridis_d() +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width") In the opening line of the RMarkdown code chunk {r} you can control the output of the code, graphs, tables using knitr syntax. For example if {r, eval = FALSE} the code will not be run, but will be shown. If {r, code = FALSE} the code will not be shown, but will be run and the output will be shown (useful in reports where the reader is only interested in the results/graphs, but not the code). You can also suppress error messages and warnings so that the reader isn’t bothered by them (but you should take notice). YOU CAN ALSO DO THIS NOW IN THE VISUAL EDITOR MODE IN RSTUDIO.
The dimensions of an individual graph in the RMarkdown document be adjusted by specifying the graph dimensions in the header for the r code chunk
```{r, fig.width = 8, fig.height = 2}
You may have realized that you can export plots in R Studio by clicking on Export in the Plots window that appears after you make a graph. You can save as a pdf, svg, tiff, png, bmp, jpeg and eps. You can also write the output directly to a file. This is particularly useful for controling the final dimensions in a reproducible way and for manuscripts.
# Plot graph to a pdf outputfile
pdf("images/iris_example_plot1.pdf", width=6, height=3)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
labs(title = "Iris Sepal Length vs Wide", x = "Sepal Length", y = "Sepal Width")
dev.off()## png
## 2
# Plot graph to a png outputfile
ppi <- 300
png("images/iris_example_plot2.png", width=6*ppi, height=4*ppi, res=ppi)
ggplot(data = iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
dev.off()## png
## 2
For more details on sizing output Cookbook for R - Output to a file - PDF, PNG, TIFF, SVG
Sometimes it is useful in controlling the image layout for a report to file with the graph and then subsequently load it into the .Rmd file. This works with png files, but not pdfs. You can also upload images made with other bioinformatic tools into your RMarkdown report.
# This is the RMarkdown style for inserting images
# Your image must be in your working directory
# This command is put OUTSIDE the r code chunk
 Another way to present a graph without the code is adding echo = FALSE within the r{} chunk - {r echo = FALSE}. This prevents code, but not the results from appearing in the knitr file.
With plotly/ggplotly (https://plot.ly/ggplot2/) you can make interactive graphs in your lab report.
First load the libraries
Let’s load the table into R
NEON_MAGs <- read_csv("data/NEON/GOLD_Study_ID_Gs0161344_NEON.csv") %>%
# remove columns that are not needed for data analysis
select(-c(`GOLD Study ID`, `Bin Methods`, `Created By`, `Date Added`)) %>%
# create a new column with the Assembly Type
mutate("Assembly Type" = case_when(`Genome Name` == "NEON combined assembly" ~ `Genome Name`,
TRUE ~ "Individual")) %>%
mutate_at("Assembly Type", str_replace, "NEON combined assembly", "Combined") %>%
separate(`GTDB-Tk Taxonomy Lineage`, c("Domain", "Phylum", "Class", "Order", "Family", "Genus"), "; ", remove = FALSE) %>%
# Get rid of the the common string "Soil microbial communities from "
mutate_at("Genome Name", str_replace, "Terrestrial soil microbial communities from ", "") %>%
# Use the first `-` to split the column in two
separate(`Genome Name`, c("Site","Sample Name"), " - ") %>%
# Get rid of the the common string "S-comp-1"
mutate_at("Sample Name", str_replace, "-comp-1", "") %>%
# separate the Sample Name into Site ID and plot info
separate(`Sample Name`, c("Site ID","subplot.layer.date"), "_", remove = FALSE,) %>%
# separate the plot info into 3 columns
separate(`subplot.layer.date`, c("Subplot", "Layer", "Date"), "-") Remove Archaea since there is no GTDB annotation for Archaea Remove the coassembly data for today
NEON_MAGs_bact_ind <- NEON_MAGs %>%
filter(Domain == "Bacteria") %>%
filter(`Assembly Type` == "Individual") Note that in this graph ggplot produces the count automatically
Use the forcats package in tidyverse to put the counts
in descending order
#### Counts passed to ggplot
This is different code that creates the same graph as above. Note in
this case the counts were first calculated in dplyr then
passed to ggplot. Both x and y values are needed. Within
geom_bar stat is set to “identify”
NEON_MAGs_bact_ind %>%
count(Phylum) %>%
ggplot(aes(x = Phylum, y = n)) +
geom_col(stat = "identity") +
coord_flip()## Warning in geom_col(stat = "identity"): Ignoring unknown parameters: `stat`
To put in descending order
NEON_MAGs_bact_ind %>%
count(Phylum) %>%
ggplot(aes(x = reorder(Phylum, n), y = n)) +
geom_col(stat = "identity") +
coord_flip()## Warning in geom_col(stat = "identity"): Ignoring unknown parameters: `stat`
NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
geom_bar() +
coord_flip()NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
geom_bar(position = "dodge") +
coord_flip()
Notice that the bars are of different width. This can be adjusted by
setting the width
NEON_MAGs_bact_ind %>%
ggplot(aes(x = fct_rev(fct_infreq(Phylum)), fill = Site)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
coord_flip()
### Multiple panels (facet_wrap)
NEON_MAGs_bact_ind %>%
ggplot(aes(x = Phylum)) +
geom_bar(position = position_dodge2(width = 0.9, preserve = "single")) +
coord_flip() +
facet_wrap(vars(Site), scales = "free", ncol = 2)For all exercises make complete graphs that are report ready. Relabel
the x-axis, y-axis and legend for clarity, add a title, add color and
size appropriately. The NAs in the taxonomy indicate a
novel species starting with the highest level. For example a
NA in a class that has an assigned phylum
Proteobacteria would be a novel class in the phylum
Proteobacteria. To filter Class and Order based on
NA.
## # A tibble: 16 × 27
## `Bin ID` Site `Sample Name` `Site ID` Subplot Layer Date `IMG Genome ID`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 3300060709… Sant… SRER_006-M-2… SRER 006 M 2021… 3300060709
## 2 3300060731… Chas… WOOD_043-M-2… WOOD 043 M 2021… 3300060731
## 3 3300060853… Chas… WOOD_002-M-2… WOOD 002 M 2021… 3300060853
## 4 3300060884… Chas… WOOD_042-M-2… WOOD 042 M 2021… 3300060884
## 5 3300060914… Guan… GUAN_043-M-2… GUAN 043 M 2021… 3300060914
## 6 3300060915… Konz… KONZ_024-M-2… KONZ 024 M 2021… 3300060915
## 7 3300061211… Grea… ONAQ_004-M-2… ONAQ 004 M 2021… 3300061211
## 8 3300061213… Nati… CLBJ_033-M-2… CLBJ 033 M 2021… 3300061213
## 9 3300061213… Nati… CLBJ_033-M-2… CLBJ 033 M 2021… 3300061213
## 10 3300061213… Nati… CLBJ_033-M-2… CLBJ 033 M 2021… 3300061213
## 11 3300061536… Nati… CLBJ_032-M-2… CLBJ 032 M 2021… 3300061536
## 12 3300061536… Nati… CLBJ_032-M-2… CLBJ 032 M 2021… 3300061536
## 13 3300061537… Nati… CLBJ_003-M-2… CLBJ 003 M 2021… 3300061537
## 14 3300061538… Nati… CLBJ_002-M-2… CLBJ 002 M 2021… 3300061538
## 15 3300061538… Nati… CLBJ_002-M-2… CLBJ 002 M 2021… 3300061538
## 16 3300061640… Niwo… NIWO_003-M-2… NIWO 003 M 2021… 3300061640
## # ℹ 19 more variables: `Bin Quality` <chr>, `Bin Lineage` <chr>,
## # `GTDB-Tk Taxonomy Lineage` <chr>, Domain <chr>, Phylum <chr>, Class <chr>,
## # Order <chr>, Family <chr>, Genus <chr>, `Bin Completeness` <dbl>,
## # `Bin Contamination` <dbl>, `Total Number of Bases` <dbl>, `5s rRNA` <dbl>,
## # `16s rRNA` <dbl>, `23s rRNA` <dbl>, `tRNA Genes` <dbl>, `Gene Count` <dbl>,
## # `Scaffold Count` <dbl>, `Assembly Type` <chr>
What are the overall class MAG counts?
What are the MAG counts for each subplot. Color by site ID.
How many novel bacteria were discovered (Show that number of
NAs for each taxonomic level)?
How many novel bacterial MAGs are high quality vs medium quality?
What sites have novel bacterial genera?
What phyla have novel bacterial genera?
Make a stacked bar plot of the total number of MAGs at each site using Phylum as the fill.
Using facet_wrap make plots of the total number of MAGs
at each site for each phylum (e.g. similar to the example above but
using the site ID and separating each graph by phylum.)
What is the relationship between MAGs genome size and the number of genes? Color by Phylum.
What is the relationship between scaffold count and MAG completeness?